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Abstract. This paper presents a theoretical analysis of sample selection bias cor- 
rection. The sample bias correction technique commonly used in machine learn- 
ing consists of reweighting the cost of an error on each training point of a biased 
sample to more closely reflect the unbiased distribution. This relies on weights 
derived by various estimation techniques based on finite samples. We analyze the 
effect of an error in that estimation on the accuracy of the hypothesis returned by 
the learning algorithm for two estimation techniques: a cluster-based estimation 
technique and kernel mean matching. We also report the results of sample bias 
correction experiments with several data sets using these techniques. Our analy- 
sis is based on the novel concept of distributional stability which generalizes the 
existing concept of point-based stability. Much of our work and proof techniques 
can be used to analyze other importance weighting techniques and their effect on 
accuracy when using a distributionally stable algorithm. 



1 Introduction 

In the standard formulation of machine learning problems, the learning algorithm re- 
ceives training and test samples drawn according to the same distribution. However, 
this assumption often does not hold in practice. The training sample available is bi- 
ased in some way, which may be due to a variety of practical reasons such as the cost 
of data labeling or acquisition. The problem occurs in many areas such as astronomy, 
econometrics, and species habitat modeling. 

In a common instance of this problem, points are drawn according to the test dis- 
tribution but not all of them are made available to the learner. This is called the sample 
selection bias problem. Remarkably, it is often possible to correct this bias by using 
large amounts of unlabeled data. 

The problem of sample selection bias correction for linear regression has been ex- 
tensively studied in econometrics and statistics (Heckman, 1979; Little & Rubin, 1986) 
with the pioneering work of Heckman (1979). Several recent machine learning publi- 
cations (Elkan, 2001; Zadrozny, 2004; Zadrozny et al., 2003; Fan et al., 2005; DudiTc 
et al., 2006) have also dealt with this problem. The main correction technique used in 
all of these publications consists of reweighting the cost of training point errors to more 
closely reflect that of the test distribution. This is in fact a technique commonly used in 
statistics and machine learning for a variety of problems of this type (Little & Rubin, 
1986). With the exact weights, this reweighting could optimally correct the bias, but, in 
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practice, the weights are based on an estimate of the sampling probabiHty from finite 
data sets. Thus, it is important to determine to what extent the error in this estimation 
can affect the accuracy of the hypothesis returned by the learning algorithm. To our 
knowledge, this problem has not been analyzed in a general manner. 

This paper gives a theoretical analysis of sample selection bias correction. Our anal- 
ysis is based on the novel concept of distributional stability which generalizes the point- 
based stability introduced and analyzed by previous authors (Devroye & Wagner, 1979; 
Kearns & Ron, 1997; Bousquet & Elisseeff, 2002). We show that large families of learn- 
ing algorithms, including all kernel-based regularization algorithms such as Support 
Vector Regression (SVR) (Vapnik, 1998) or kernel ridge regression (Saunders et al., 
1998) are distributionally stable and we give the expression of their stability coefficient 
for both the li and I2 distance. 

We then analyze two commonly used sample bias correction techniques: a cluster- 
based estimation technique and kernel mean matching (KMM) (Huang et al., 2006b). 
For each of these techniques, we derive bounds on the difference of the error rate of 
the hypothesis returned by a distributionally stable algorithm when using that estima- 
tion technique versus using perfect reweighting. We briefly discuss and compare these 
bounds and also report the results of experiments with both estimation techniques for 
several publicly available machine learning data sets. Much of our work and proof tech- 
niques can be used to analyze other importance weighting techniques and their effect 
on accuracy when used in combination with a distributionally stable algorithm. 

The remaining sections of this paper are organized as follows. Section 2 describes in 
detail the sample selection bias correction technique. Section 3 introduces the concept 
of distributional stability and proves the distributional stability of kernel-based regular- 
ization algorithms. Section 4 analyzes the effect of estimation error using distribution- 
ally stable algorithms for both the cluster-based and the KMM estimation techniques. 
Section 5 reports the results of experiments with several data sets comparing these esti- 
mation techniques. 

2 Sample Selection Bias Correction 
2.1 Problem 

Let X denote the input space and Y the label set, which may be {0, 1} in classification 
or a subset of M in regression estimation problems, and let V denote the true distribution 
over X X Y according to which test points are drawn. In the sample selection bias 
problem, some pairs z = {x,y) drawn according to V are not made available to the 
learning algorithm. The learning algorithm receives a training sample 5* of m labeled 
points zi, . . . , Zm drawn according to a biased distribution V over X xY . This sample 
bias can be represented by a random variable s taking values in {0, 1}: when s = 1 the 
point is sampled, otherwise it is not. Thus, by definition of the sample selection bias, 
the support of the biased distribution V is included in that of the true distribution V. 

As in standard learning scenarios, the objective of the learning algorithm is to select 
a hypothesis h out of a hypothesis set H with a small generalization error R{h) with 
respect to the true distribution T), R{h) = ^{x.y)~'D[c{h, z)], where c(/i, z) is the cost 
of the error of h on point z ^ X xY . 

While the sample S is collected in some biased manner, it is often possible to derive 
some information about the nature of the bias. This can be done by exploiting large 



amounts of unlabeled data drawn according to the true distribution V, which is often 
available in practice. Thus, in the following let [/ be a sample drawn according to T) 
and S C U a labeled but biased sub-sample. 



2.2 Weighted Samples 

A weighted sample is a training sample S of m labeled points, zi , . . . , z,„ drawn 
i.i.d. from X x Y, that is augmented with a non-negative weight lUi > for each point 
Zi. This weight is used to emphasize or de-emphasize the cost of an error on Zi as in 
the so-called importance weighting or cost-sensitive learning (Elkan, 2001; Zadrozny 
et al., 2003). One could use the weights Wi to derive an equivalent unweighted sample 
iS" where the multiplicity of Zi would reflect its weight Wi, but most learning algorithms, 
e.g., decision trees, logistic regression, AdaBoost, Support Vector Machines (SVMs), 
kernel ridge regression, can directly accept a weighted sample Sw- We will refer to 
algorithms that can directly take 5*^, as input as weight-sensitive algorithms. 
The empirical error of a hypothesis /i on a weighted sample is defined as 

m 

Rw{h) ^'^Wic{h,Zi). (1) 

i=l 

Proposition 1. Let T)' be a distribution whose support coincides with that ofD and let 
Sw be a weighted sample with Wi = Prp {zi) / Pr-p' (zi) for all points Zi in S. Then, 

E [R^{h)]^R{h) = E [cih,z)]. (2) 
Proof. Since the sample points are drawn i.i.d., 

E [R^{h)]^-y E [w,c{h,z,)]^ E Kc(/i,zi)]. (3) 

z 

By definition of w and the fact that the support of D and D' coincide, the right-hand 
side can be rewritten as follows 

E |?77T?f("i)^('^'"i)= E PA^i)cih,z,)^ (4) 

^ — ' rYTnlZ-i) T>' ^ — ' V zi^TJ 
V'{zi)^0 ^ V{z,)^Q 

This last term is the definition of the generaUzation error R{h). □ 



2.3 Bias Correction 

The probability of drawing z ~ {x,y) according to the true but unobserved distribution 
"D can be straightforwardly related to the observed distribution V. By definition of the 
random variable s, the observed biased distribution V can be expressed by Prv' [z] = 
Pr-p [z\s = 1] . We will assume that all points z in the support of D can be sampled with 
a non-zero probability so the support of T> and V coincide. Thus for all z E X x Y, 
Pr[s = l|z] 7^ 0. Then, by the Bayes formula, for all z in the support of D, 

D ^ ^ Pr[s = l|z] Pr[5 = l|z] P'^ ^ ^ 



Thus, if we were given the probabilities Pr[s = 1] and Pr[s = l|z], we could derive the 
true probability Pip from the biased one Prp/ exactly and correct the sample selection 
bias. 

It is important to note that this correction is only needed for the training sample S, 
since it is the only source of selection bias. With a weight-sensitive algorithm, it suffices 
to reweight each sample Zi with the weight Wi = pTp^Yp-j- Thus, Pr[s = l\z] need 
not be estimated for all points z but only for those falling in the training sample S. By 
Proposition 1, the expected value of the empirical error after reweighting is the same as 
if we were given samples from the true distribution and the usual generalization bounds 
hold for ^(/i) and R{h). 

When the sampling probability is independent of the labels, as it is commonly as- 
sumed in many settings (Zadrozny 2004; 2003), Pr[s = l\z] = Pr[s = l|a;], and 
Equation 5 can be re-written as 



In that case, the probabilities Pr[s = 1] and Pr[s = l|a;] needed to reconstitute Pru 
from Prp' do not depend on the labels and thus can be estimated using the unlabeled 
points in U. Moreover, as already mentioned, for weight-sensitive algorithms, it suffices 
to estimate Pr[,s = l\xi] for the points Xi of the training data; no generalization is 
needed. 

A simple case is when the points are defined over a discrete set.^ Pr[s = l\x] can 
then be estimated from the frequency mx/rix, where rrix denotes the number of times 
X appeared in S C U and rix the number of times x appeared in the full data set U. 
Pr[s = 1] can be estimated by the quantity 151/11/1. However, since Pr[s = 1] is a 
constant independent of x, its estimation is not even necessary. 

If the estimation of the sampling probability Pr[s ~ l\x] from the unlabeled data 
set U were exact, then the reweighting just discussed could correct the sample bias 
optimally. Several techniques have been commonly used to estimate the reweighting 
quantities. But, these estimate weights are not guaranteed to be exact. The next sec- 
tion addresses how the error in that estimation affects the error rate of the hypothesis 
returned by the learning algorithm. 

3 Distributional Stability 

Here, we wiU examine the effect on the error of the hypothesis returned by the learning 
algorithm in response to a change in the way the training points are weighted. Since the 
weights are non-negative, we can assume that they are normalized and define a distribu- 
tion over the training sample. This study can be viewed as a generalization of stability 
analysis where a single sample point is changed (Devroye & Wagner, 1979; Kearns 
& Ron, 1997; Bousquet & Elisseeff, 2002) to the more general case of distributional 
stability where the sample's weight distribution is changed. 

Thus, in this section the sample weight W of Sw defines a distribution over S. For 
a fixed learning algorithm L and a fixed sample S, we will denote by hw the hypothesis 




Pr[s = 1] 



(6) 




' This can be as a result of a quantization or clustering technique as discussed later. 



returned by L for the weighted sample S'w- We will denote by (i(yV, W') a divergence 
measure for two distributions W and W'. There are many standard measures for the 
divergences or distances between two distributions, including the relative entropy, the 
Hellinger distance, and the Ip distance. 

Definition 1 (Distributional /3-StabiUty). A learning algorithm L is said to be distri- 
butionally /3-stable for the divergence measure d if for any two weighted samples Sy\; 
and Sv\i', 

VzeXxr, \c{hw,z)-c{hw'^z)\< (3d{yV,W). (7) 

Thus, an algorithm is distributionally stable when small changes to a weighted sample's 
distribution, as measured by a divergence d, result in a small change in the cost of an 
error at any point. The following proposition follows directly from the definition of 
distributional stability. 

Proposition 2. Let L be a distributionally (3-stable algorithm and let /lyy f/iyv' ) denote 
the hypothesis returned by L when trained on the weighted sample S'w ( resp. Syv )■ 
Let Wt denote the distribution according to which test points are drawn. Then, the 



following holds 

\Rihw)~Rihw')\<(3d{W,W). (8) 
Proof. By the distributional stabihty of the algorithm, 

E [\ciz,hw) - c{z,hw')\] < l3d{W,W'), (9) 

which implies the statement of the proposition. □ 



3.1 Distributional Stability of Kernel-Based Regularization Algorithms 

Here, we show that kernel-based regularization algorithms are distributionally /3-stable. 
This family of algorithms includes, among others. Support Vector Regression (SVR) 
and kernel ridge regression. Other algorithms such as those based on the relative entropy 
regularization can be shown to be distributionally /3-stable in a similar way as for point- 
based stability. Our results also apply to classification algorithms such as Support Vector 
Machine (SVM) (Cortes & Vapnik, 1995) using a margin-based loss function as in 
(Bousquet & Elisseeff, 2002). 

We will assume that the cost function c is a-admissible, that is there exists a G E_|_ 
such that for any two hypotheses h,h' E H and for all z ~ {x,y) G X x Y, 

\c{h, z) - c{h', z)\ < a\h{x) - h'{x)\. (10) 

This assumption holds for the quadratic cost and most other cost functions when the hy- 
pothesis set and the set of output labels are bounded by some M S M+i Vh E H,yx G 
X, \h{x)\ < M and Vy G Y, \y\ < M. We will also assume that c is differentiable. This 
assumption is in fact not necessary and all of our results hold without it, but it makes 
the presentation simpler 

Eet N : H ^ M_(. be a function defined over the hypothesis set. Regularization- 
based algorithms minimize an objective of the form 



Fw{h) = Rw{h) + XN{h), 



(11) 



where A > is a trade-off parameter We denote hy Bp the Bregman divergence asso- 
ciated to a convex function F, BpifWg) = F{f) — F{g) — {f — g, WF{g)), and define 
Ah as Ah^h' - h. 

Lemma 1. Let the hypothesis set H be a vector space. Assume that N is a proper 
closed convex function and that N is differentiable. Assume that i^vv admits a minimizer 
h Cz H and Fw a minimizer h' G H. Then, the following bound holds, 

BN{h'\\h) + BN{h\\h') < ^^iC^'''^') sup \Ahix)\. (12) 

A xes 

Proof. Since Bf„ = Bj^^ + Ai?Ar and Bp^, = Bj^ ^ + A_Bjv, and a Bregman 
divergence is non-negative, \[BN{h'\\h) + BN{h\\h')) < Bp„{h'\\h) + Bp^,{h\\h'). 
By the definition of h and h' as the minimizers of Fw and i^ws 

Bp„{h'\\h) + Bp^,{h\\h') = Rp„{h') - Rp^{h)+Rp^,{h) - Rp„,{h'). (13) 

Thus, by the cr-admissibility of the cost function c, using the notation Wi = Wixi) and 

= >v'(xO, 

X{BN{h'\\h) + B^ihWh')) < Rp„{h') - Rp„{h)+Rp^,{h) - Rp„,{h') 



E 

■m 

E 



c{h', Zi)y^, - c(h, z,)W^ + c{h, z,)W[ - c{h', z,)W[ 



{c{h',Zi)~c{h,z,)){W^~W'i) 



E 

i=l 



a\Ah{x,)\{W^-W'i) 



(14) 



< ah{W,W)svLY>\Ah{x)l 
which establishes the lemma. 



□ 



Given xi, . . . , Xm G X and a positive definite symmetric (PDS) kernel K, we denote 
by K e R™x™ the kernel matrix defined by = K{x„ x-j) and by Amax(K) € R+ 
the largest eigenvalue of K. 

Lemma 2. Let H be a reproducing kernel Hilbert space with kernel K and let the 
regularization function N be defined by N {■) — ||-||^. Then, the following bound holds. 



B.(V||/.) + B„W/.-)< ''^-"^''-'"''""' ||^/.||.^ 



(15) 



Proof. As in the proof of Lemma 1 , 

m 

X{BN{h'\\h) + BN{h\\h')) <Y, 



{c{h',z,)~c{h,z,)){W.,-Wl) 



(16) 



By definition of a reproducing kernel Hilbert space H, for any hypothesis h E H, 
Vx G X, h{x) — {h, K{x, ■)) and thus also for any Ah — h' - h with h, h' e iJ, Vx e 



X, Ah{x) = {Ah, K{x, •)). Let AWi denote W- - W^, AW the vector whose compo- 
nents are the AWi's, and let V denote BN{h'\\h) + B]s[{h\\h'). Using cr-admissibility, 
y < <jJ:T=i\^H^^)^^^\ = Tj:Zi\{^h,AW,K{x,,-))\.Lete, € {-1,+1} 
denote the sign of {Ah, AW^K{xi,-)). Then, 

Ah,Y,e,AW,K{x„ ■) \ < a\\Ah\\K \\J2^^^^^K{x,, ■)\\k 

i=l / i=l 

,1/2 (17) 



= a\\Ah\\K{Y. ^^^1^^^^^JK{X„XJ)) 

^ <j\\Ah\\K[A{We)'^KA{We)]^ < a\\Ah\\K\\AWhxL4^). 

In this derivation, the second inequality follows from the Cauchy-Schwarz inequality 
and the last inequality from the standard property of the Rayleigh quotient for PDS 
matrices. Since ||Z\W||2 — hi^V, VV"), this proves the lemma. □ 

Theorem 1. Let K be a kernel such that K[x, x) < k < oo for all x X. Then, the 

regularization algorithm based on N{-) = ||-||^ is distributionally (3-stable for the li 

1 

distance with (3 < and for the I2 distance with (3 < " ^ — -. 



Proof For^(-) ||-|||., wehaveBjv(/i'||/i) = \\h' -h\\%, thus BN{h'\\h)+BN{h\\h') 
2\\Ah\\j^ and by Lemma 1, 

2\\Ah\\l <^h^^sup\Ahix)\ < (18) 

Thus \\Ah\\K < By cr-admissibility of c, 

yzeX xY, \c{h',z) - c{h,z)\ < a\Ah{x)\ < Ka\\Ah\\K. (19) 

Therefore, 

\fzeXxY,\c{h',z)-c{h,z)\<^^^^^^^^^, (20) 

which shows the distributional stability of a kernel-based regularization algorithm for 
the li distance. Using Lemma 2, a similar derivation leads to 

y ^^-.vlfh' \ (1 M ^ a^n\L^{K)h{W,W') 

Vz e A X y, |c(/i , z) — c(/i, 2)1 < — , (21) 

zx 

which shows the distributional stability of a kernel-based regularization algorithm for 
the I2 distance. □ 

Note that the standard setting of a sample with no weight is equivalent to a weighted 
sample with the uniform distribution Wu'- each point is assigned the weight 1/m. Re- 
moving a single point, say xi, is equivalent to assigning weight to xi and 1/(771 — 1) 
to others. Let Wu' be the corresponding distribution, then 



Thus, in the case of kernel-based regularized algorithms and for the li distance, stan- 
dard uniform /^-stability is a special case of distributional /3-stability. It can be shown 
similarly that hiWu^y^w) = —r^ ■ 

a/ 77l(m — 1) 

4 Effect of Estimation Error for Kernel-Based Regularization 
Algorithms 

This section analyzes the effect of an error in the estimation of the weight of a train- 
ing example on the generalization error of the hypothesis h returned by a weight- 
sensitive learning algorithm. We will examine two estimation techniques: a straight- 
forward histogram-based or cluster-based method, and kernel mean matching (KMM) 
(Huang et al., 2006b). 



4.1 Cluster-Based Estimation 

A straightforward estimate of the probability of sampling is based on the observed 
empirical frequencies. The ratio of the number of times a point x appears in S and 
the number of times it appears in U is an empirical estimate of Pr[s = l|a;]. Note 
that generalization to unseen points x is not needed since reweighting requires only 
assigning weights to the seen training points. However, in general, training instances 
are typically unique or very infrequent since features are real-valued numbers. Instead, 
features can be discretized based on a partitioning of the input space X. The partitioning 
may be based on a simple histogram buckets or the result of a clustering technique. The 
analysis of this section assumes such a prior partitioning of X. 

We shall analyze how fast the resulting empirical frequencies converge to the true 
sampling probability. For x E U, let Ux denote the subsample of U containing exactly 
all the instances of x and let ti = \U\ and rix = \Ux\- Furthermore, let n' denote the 
number of unique points in the sample U. Similarly, we define Sx, m, nix and m' for 
the set S. Additionally, denote by po = niinjjgc/ Pr[a;] ^ 0. 

Lemma 3. Let 5 > Q. Then, with probability at least 1 — 5, the following inequality 
holds for all x in S: 



Pr[s l\x] 

Ux 



/ log 2m' + log T 
<\— —■ (23) 



Proof. For a fixed a; G J7, by Hoeffding's inequality, 

n 

Pr[|Pr[s = l|d-^| >e] = Pr [l Pr[s = iLxl - ^1 

i—1 
n 

<^2e-^"'PrK=i]. 
1=1 

Since Ux is a binomial random variable with parameters Prj/ [x] = Px and n, this last 
term can be expressed more explicitly and bounded as follows: 

2 pri'^- = < 2 E i " = 2(p,e"2^' + (1 - p^)r 

i=l 1=0 \ / 

= 2(1 -p^l - e-'^'))" < 2exp(-p.n(l - e-'^')). 



Pr 

u 



\Pt[s = l\x] - ^\>e 



Since for x S [0, 1], 1 - e^^ > x/2, this shows that for e e [0, 1], 

<2e-P-"^'. (24) 
By the union bound and the definition of pq, 

m 1 2 



Pr 

u 



3xe S : Pr[s = l|a;] ^ > e 



Setting S to match the upper bound yields the statement of the lemma. □ 

The following proposition bounds the distance between the distribution W correspond- 
ing to a perfectly reweighted sample (S'w) and the one corresponding to a sample that 
is reweighted according to the observed bias {Sy^). For a sampled point Xi = x, these 
distributions are defined as follows: 

W{x,) = -^ and >V(x,) = -^, (25) 

where, for a distinct point x equal to the sampled point Xi, we define p{xi) = Pr[,s = 
l\x]andp{x,) = ^. 

Propositions. Let B ~ max Taaix{l/p{xi),l/p{xi)). Then, the li and I2 distances 

i—l.....m 

of the distributions W and W can be bounded as follows. 



9 / log 2to' + log i , „ / log 2m' + log 4 

W,>V < — ^andl2iW,W) < B\ — ^. (26) 

V Pon V Ponm 



Proof. By definition of the I2 distance. 



1 V 1 ^ f p{xi) - p{xi)'^ 



m'^ ^\p{xi) p{xi) J '^^ ~~{\ p{xi)p{xi) 
B^ 

< — m8ix{p{x,) - p{x,))'^ . 
m i 

It can be shown similarly that li{yV, W) < B^ max^ \p{xi) — p{xi)\. The application 
of the uniform convergence bound of Lemma 3 directly yields the statement of the 
proposition. □ 

The following theorem provides a bound on the difference between the generalization 
error of the hypothesis returned by a kernel-based regularization algorithm when trained 
on the perfectly unbiased distribution, and the one trained on the sample bias-corrected 
using frequency estimates. 

Theorem 2. Let K be a PDS kernel such that K{x, x) < k < 00 for all x € X. Let 
h\v be the hypothesis returned by the regularization algorithm based on N{-) = 
using Sw, and /i^ the one returned after training the same algorithm on Sy^. Then, 



for any 5 > Q, with probability at least 1 — 5, the difference in generalization error of 
these hypotheses is bounded as follows 



\R{hw) - Rih^)\ < 



a'^n'^B'^ / log 2m' + log 



.5 



Pan 



\R{hw) - R{h^)\ < 



w^i - 2A V 

g^AtAiax(K)g^ / log 2m' + log I 



(27) 



wn - 2A 



Ponm 



Proof. The result follows from Proposition 2, the distributional stability and the bounds 
on the stability coefficient f3 for kernel-based regularization algorithms (Theorem 1), 
and the bounds on the li and I2 distances between the correct distribution W and the 
estimate W. □ 

Let no be the number of occurrences, in U, of the least frequent training example. 
For large enough ji, p^n w no, thus the theorem suggests that the difference of error 
rate between the hypothesis returned after an optimal reweighting versus the one based 

on frequency estimates goes to zero as . In practice, m' < m, the number of 

distinct points in S is small, a fortiori, log m' is very small, thus, the convergence rate 
depends essentially on the rate at which no increases. Additionally, if Aniax(^) < rn 
(such as with Gaussian kernels), the Z2-based bound will provide convergence that is at 
least as fast. 



4.2 Kernel Mean Matching 

The following definitions introduced by Steinwart (2002) will be needed for the pre- 
sentation and discussion of the kernel mean matching (KMM) technique. Let X be a 
compact metric space and let C{X) denote the space of all continuous functions over 
X equipped with the standard infinite norm || • ||oo. Let K : X y. X ^ M.bt a. PDS 
kernel. There exists a Hilbert space F and a map <P: X ^ F such that for allx,y £ X, 
K{x, y) = {^{x), ^{y))- Note that for a given kernel K, F and ^ are not unique and 
that, for these definitions, F does not need to be a reproducing kernel Hilbert space 
(RKHS). 

Let V denote the set of all probability distributions over X and let : T' ^ F be 
the function defined by 

VpeP, ^l{p) = E [^{x)]. (28) 

A function g : X ^ R is said to be induced by K if there exists w E F such that for all 
X € X, g{x) = {w, 'P{x)). K is said to be universal if it is continuous and if the set of 
functions induced by K are dense in C{X). 

Theorem 3 (Huang et al. (2006a)). Let F be a separable Hilbert space and let K be a 
universal kernel with feature space F and feature map <P: X ^ F. Then, fi is injective. 



Proof. The proof given by Huang et al. (2006a) does not seem to be complete, we have 
included a complete proof in the Appendix. □ 



The KMM technique is applicable when the learning algorithm is based on a universal 
kernel. The theorem shows that for a universal kernel, the expected value of the fea- 
ture vectors induced uniquely determines the probability distribution. KMM uses this 
property to reweight training points so that the average value of the feature vectors for 
the training data matches that of the feature vectors for a set of unlabeled points drawn 
from the unbiased distribution. 

Let 7i denote the perfect reweighting of the sample point Xi and 7^ the estimate 
derived by KMM. Let B' denote the largest possible reweighting coefficient 7 and let 
e be a positive real number. We will assume that e is chosen so that e < 1/2. Then, the 
following is the KMM constraint optimization 



min G(7) = II - E ^^"^(^^ " " E '^(^^) 

1=1 i=l 
^ m 

subject to 7j e [0, B'l A I — 7i - 1 1 < e. 



(29) 



' m 

i=l 



Let 7 be the solution of this optimization problem, then = 1 + e' with 

— e < e' < e. For i E let 7^' = 7i/(l + e'). The normalized weights used in 

KMM's reweighting of the sample are thus defined by ^l/m with ;i X]i=i 7i' = 1- 

As in the previous section, given xi, . . . ,Xm E X and a strictly positive def- 
inite universal kernel K, we denote by K e jgmxm kernel matrix defined by 
Kij = K{xi,Xj) and by Amin(K) > the smallest eigenvalue of K. We also denote 
by cond(K) the condition number of the matrix K: cond(K) = Amax(K)/Ainin(K). 
When K is universal, it is continuous over the compact X x X and thus bounded and 
there exists k < 00 such that K{x, x) < k for all x e X. 

Proposition 4. Let K be a strictly positive definite universal kernel. Then, for any 5 > 
0, with probability at least 1 ~ S, the I2 distance of the distributions 7'/™ and j/m is 
bounded as follows: 



-11(7 -7)ll2< ^ + ^ J— + - l + \/21og- . (30) 



m y/m \ 2 (r^\ V m 7i\ V S 

min \ / 

Proof. Since the optimal reweighting 7 verifies the constraints of the optimization, by 
definition of 7 as a minimizer, G{^) < G(7). Thus, by the triangle inequality, 

\\-Y^%<p{x.,)^-Y,lM^^)\\<Gil) + Gh)<2Gil)■ (31) 
m ^ — ' m ^ — ' 

i=l i=l 

Let L denote the left-hand side of this inequality: i = ^|| J^'iLiili ~ 

definition of the norm in the Hilbert space, L = ^ V ~ 7)^^(7 — 7). Then, by the 

standard bounds for the Rayleigh quotient of PDS matrices, L > ^Aj^jjjj(K)|| (7—7)112- 
This combined with Inequality 3 1 yields 

— 11(7 - 7)ll2 < — ^ • (32) 

A^„(K) 



Thus, by the triangle inequahty, 



-||(7'-7)l|2<-|l(7'-7)l|2 + -||(7-7)ll2 
m m m 

\e'\/m , 2G(7) 
^ 2\e'\B'^ 2G(7) ^ 2eB' ^ 2G{-f) 



(33) 



m 



It is not difficult to show using McDiarmid's inequality that for any 5 > Q, with proba- 
bility at least 1 — 5, the following holds (Lemma 4, (Huang et al., 2006a)): 



G(,)<.^y^ + 1(1 + ^2 log I). (34) 

This combined with Inequality 33 yields the statement of the proposition. □ 

The following theorem provides a bound on the difference between the generalization 
error of the hypothesis returned by a kernel-based regularization algorithm when trained 
on the true distribution, and the one trained on the sample bias-corrected KMM. 

Theorem 4. Let K be a strictly positive definite symmetric universal kernel. Let hj be 
the hypothesis returned by the regularization algorithm based on N(-) = ||-||^' using 
S^/m ond h^i the one returned after training the same algorithm on /m- Then, for 
any (5 > 0, with probability at least 1 — 5, the dift^erence in generalization error of these 
hypotheses is bounded as follows 

\R{h,)-R{h:;,)\ < 




For e = 0, the bound becomes 



Proof. The result follows from Proposition 2 and the bound of Proposition 4. □ 

Comparing this bound for 6 = with the I2 bound of Theorem 4, we first note that 
B and B' are essentially related modulo the constant Pr[s = 1] which is not included 
in the cluster-based reweighting. Thus, the cluster-based convergence is of the order 



0(A^ax(K)B2y'^^) and the KMM convergence of the order 0(cond5(K)^). 
Taking the ratio of the former over the latter and noticing p^^ « 0{B), we obtain the 



expression O \^\J ^■i""(^)^^'°g™ j ^ xhus, for n > Xmm(K.)B log(m') the convergence 

of the cluster-based bound is more favorable, while for other values the KMM bound 
converges faster 



5 Experimental Results 



In this section, we will compare the performance of the cluster-based reweighting tech- 
nique and the KMM technique empirically. We will first discuss and analyze the prop- 
erties of the clustering method and our particular implementation. 

The analysis of Section 4. 1 deals with discrete points possibly resulting from the 
use of a quantization or clustering technique. However, due to the relatively small size 
of the public training sets available, clustering could leave us with few cluster represen- 
tatives to train with. Instead, in our experiments, we only used the clusters to estimate 
sampling probabilities and applied these weights to the full set of training points. As 
the following proposition shows, the li and I2 distance bounds of Proposition 5 do not 
change significantly so long as the cluster size is roughly uniform and the samphng 
probability is the same for all points within a cluster We will refer to this as the clus- 
tering assumption. In what follows, let Pr[s — l|Ci] designate the sampling probability 
for all X € Q. Finally, define g(Q) = Pr[s = 1|Q] and q{Q) = |C, n 5|/|C,, n U\. 

Propositions. Let B ~ max max{l / q{Ci) , 1 / q{Ci)). Then, the li and I2 distances 

i—l....,m 

of the distributions W and W can be bounded as follows, 

, 2 /|CA/|fc(log 2fc + logi) , 2 /!CM|fc(log 2/0 + log i) 

y qonm y qonm'' 

where qq ~ imnq{Ci) and |Cj\/| ~ max^ |Ci|. 

Proof. By definition of the I2 distance, 

l*"" /l l*^ /I 

^ 2^ 

i = l 

The right-hand side of the first line follows from the clustering assumption and the 
inequality then follows from exactly the same steps as in Proposition 5 and factoring 
away the sum over the elements of d. Finally, it is easy to see that the maxi(g(Ci) — 
q{Ci)) term can be bounded just as in Lemma 3 using a uniform convergence bound, 
however now the union bound is taken over the clusters rather than unique points. □ 

Note that when the cluster size is uniform, then |Ca/ |fc = to, and the bound above leads 
to an expression similar to that of Proposition 5. 

We used the leaves of a decision tree to define the clusters. A decision tree selects 
binary cuts on the coordinates of x G X that greedily minimize a node impurity mea- 
sure, e.g., MSB for regression (Brieman et al., 1984). Points with similar features and 
labels are clustered together in this way with the assumption that these will also have 
similar sampling probabilities. 

Several methods for bias correction are compared in Table 1 . Each method assigns 
corrective weights to the training samples. The unweighted method uses weight 1 for 
every training instance. The ideal method uses weight pTp^x]^, which is optimal but 



Table 1. Normalized mean-squared error (NMSE) for various regression data sets using un- 
weiglited, ideal, clustered and kernel-mean-matched training sample reweightings. 



Data set 


\U\ \S\ ntest 


Unweighted Ideal Clustered KMM 


ABALONE 

BANK32NH 

BANK8FM 

CAL-HOUSING 

CPU-ACT 

CPU-SMALL 

HOUSING 

kinSnm 

PUMA8NH 


2000 724 2177 
4500 2384 3693 
4499 1998 3693 
16512 9511 4128 
4000 2400 4192 
4000 2368 4192 
300 116 206 
5000 2510 3192 
4499 2246 3693 


0.654±0.019 0.551±0.032 0.623±0.034 0.709±0.122 
0.903±0.022 0.610±0.044 0.635±0.046 0.691±0.055 
0.085±0.003 0.058±0.001 0.068±0.002 0.079±0.013 
0.395±0.010 0.360±0.009 0.375±0.010 0.595±0.054 
0.673±0.014 0.523±0.080 0.568±0.018 0.518±0.237 
0.682±0.053 0.477±0.097 0.408±0.071 0.531±0.280 
0.509±0.049 0.390±0.053 0.482±0.042 0.469±0.148 
0.594±0.008 0.523±0.045 0.574±0.018 0.704±0.068 
0.685±0.013 0.674±0.019 0.641±0.012 0.903±0.059 



requires the sampling distribution to be known. The clustered method uses weight 
|Ci n [/|/|Ci n S\, where the clusters Ci are regression tree leaves with a minimum 
count of 4 (larger cluster sizes showed similar, though declining, performance). The 
KMM method uses the approach of Huang et al. (2006b) with a Gaussian kernel and 
parameters a ~ \/d/2 for x e M'^, B = 1000, e = 0. Note that we know of no 
principled way to do cross-validation with KMM since it cannot produce weights for a 
held-out set (Sugiyama et al., 2008). 

The regression datasets are from LI A AD"* and are sampled with P[s = l\x] = 

where v = l2£l£zjEl ^ G K.'' and w e chosen at random from [—1, ll'^. In our 
experiments, we chose ten random projections w and reported results with the w, for 
each data set, that maximizes the difference between the unweighted and ideal methods 
over repeated sampling trials. In this way, we selected bias samplings that are good 
candidates for bias correction estimation. 

For our experiments, we used a version of SVR available from LibSVM^ that can 
take as input weighted samples, with parameter values C = 1, and e = 0.1 combined 
with a Gaussian kernel with parameter a ~ \/d/2. We report results using normalized 

mean- squared error (NMSE): Er=r ^^'"s^'-*' , and provide mean and standard 
deviations for ten-fold cross-validation. 

Our results show that reweighting with more reliable counts, due to clustering, can 
be effective in the problem of sample bias correction. These results also confirm the 
dependence that our theoretical bounds exhibit on the quantity uq. The results obtained 
using KMM seem to be consistent with those reported by the authors of this technique.^ 

6 Conclusion 

We presented a general analysis of sample selection bias correction and gave bounds 
analyzing the effect of an estimation error on the accuracy of the hypotheses returned. 
The notion of distributional stability and the techniques presented are general and can 

■* www .liaad.up.pt/''ltorgo/Regression/DataSets. html . 
^ www .csie.ntu.edu.tw/''cjlin/ libsvmtools . 

* We thank Arthur Gretton for discussion and help in clarifying the choice of the parameters and 
design of the KMM experiments reported in (Huang et al., 2006b), and for providing the code 
used by the authors for comparison studies. 



be of independent interest for the analysis of learning algorithms in other settings. In 
particular, these techniques apply similarly to other importance weighting algorithms 
and can be used in other contexts such that of learning in the presence of uncertain 
labels. The analysis of the discriminative method of (Bickel et al., 2007) for the problem 
of covariate shift could perhaps also benefit from this study. 
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A Proof of Theorem 3 



Proof. Assume that = for two probability distributions p and q in V. It is 
knownthatif E:c~p[/(a;)] = '&x^q[f{x)] for any / e C{X), thenp = q. Let / e C[X) 
and fix e > 0. Since A' is universal, there exists a function g induced by K such that 

11/ - .9II00 < e. Ej;^p[/(x-)] - E3.^g[/(a;)] can be rewritten as 



E [f{x) - g{x)] + E [g{x)] - E [g{x)] + E - f{x)]. 



(36) 



Since |E:j:^p[/(a;) - g{x)]\ < E^^^p |/(a;) - g{x)\ < \\f - g\\co < e and similarly 
\E,^g[f{x)-gix)]\<e, 



E [fix)] - E [/(a 



< 



E [g{x)] - E [g{x)] 



2e. 



(37) 



Since 5 is induced by K, there exists w E F such that for all x G X, g{x) = {w, 'P{x)). 
Since F is separable, it admits a countable orthonormal basis (e„)„gN- For n € N, 
letit;„ = (w,e„) and^„(a;) = (^(a;),e„). Then, g{x) = I]r=o ^"^n(^)- For each 
S N, consider the partial sum gN{x) — '^n=o Wn^n{x)- By the Cauchy-Schwarz 
inequality. 



N 



N 



\9n{x)\ < ||5]u;„e„||^/^!|^<?„(x)e„||^/^ < MI/'Mx) 



1/2 
2 • 



(38) 



Since K is universal, it is continuous and thus (l> is also continuous (Steinwart, 2002). 
Thus X 1-^ ||<?(a;)||2 is a continuous function over the compact X and admits an upper 
bound B >0. Thus, \g]y{x)\ < ^y\\w\\2B. The integral /| ■\/||w||2-B|dp is clearly well 
defined and equals ■\/||u'||2-B. Thus, by the Lebesgue dominated convergence theorem, 
the following holds: 

^E [g{x)] = / '^Wn<Pn{x)dp{x) = "^wn <Pn{x)dp{x). (39) 



Tl = 



n=0 



By definition of E^^^p [^(a^)]- the last term is the inner product of w and that term. Thus, 



E [g{x)]^(w, E [<P{x)]) ^{w,fi{p)) 



(40) 



A similar equahty holds with the distribution q, thus, 

E \g(x)\ - E \g(x)\ = {w,^{p) - fi{q)) = 0. 

Thus, Inequality 37 can be rewritten as 

E [fix)] - E [fix)] < 2e, 



(41) 



for all e > 0. This impUes E^^p[/(x)] = E^^g[/(x)] for all / e C{X) and the 
injectivity of fi. □ 



